System and method for developing a risk profile for an internet service

ABSTRACT

A method and system for controlling access to an Internet resource is disclosed herein. When a request for an Internet resource, such as a Web site, is transmitted by an end-user of a LAN, a security appliance for the LAN analyzes a reputation index for the Internet resource before transmitting the request over the Internet. The reputation index is based on a plurality of factors for the Internet resource. A client application&#39;s access to the Internet resource can be allowed or denied based on the reputation index of the Internet resource.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a continuation of, and claims a benefit of priorityunder 35 U.S.C. § 120 from U.S. patent application Ser. No. 13/888,341,filed May 6, 2013, entitled “SYSTEM AND METHOD FOR DEVELOPING A RISKPROFILE FOR AN INTERNET RESOURCE,” which is a continuation of, andclaims a benefit of priority under 35 U.S.C. § 120 from U.S. patentapplication Ser. No. 12/709,504, filed Feb. 21, 2010, entitled “SYSTEMAND METHOD FOR DEVELOPING A RISK PROFILE FOR AN INTERNET RESOURCE,”issued as U.S. Pat. No. 8,438,386, which claims a benefit of priority toU.S. Provisional Patent Application No. 61/241,389, filed Sep. 10, 2009,entitled “SYSTEM AND METHOD FOR DEVELOPING A RISK PROFILE FOR ANINTERNET RESOURCE,” and also claims a benefit of priority to U.S.Provisional Patent Application No. 61/171,264, filed Apr. 21, 2009,entitled “SYSTEM AND METHOD FOR DEVELOPING A RISK PROFILE FOR ANINTERNET RESOURCE,” all of which are fully incorporated by referenceherein for all purposes.

TECHNICAL FIELD

The present invention is related to assessing risk profiles of Internetresources. More specifically, the present invention is related to asystem and method for developing a risk profile for an Internet resourceby generating a reputation index, based on attributes of the resourcecollectively referred to as the reputation vector of the resource.

BACKGROUND OF THE RELATED ART

Management of internet access, particularly to Web sites, has beenaccomplished in the past using “Content Filtering”, where Web sites areorganized into categories and requests for Web content are matchedagainst per-category policies and either allowed or blocked. This typeof management focuses on the subject matter of a Web site, and providesvisibility into, for example, how employees spend their time, and theircompany's network bandwidth usage, during the course of the day. Thesesolutions also allow companies to enforce established internet usagepolicy (IUP) by blocking Web sites whose subject matter violates theirIUP.

Security solutions, such as anti-virus products, examine file or Webpage content to discover known patterns or signatures that representsecurity threats to users, computers, or corporate networks. These focusnot on the subject matter of a site, but look for viruses and other‘malware’ that are currently infecting the site. However, currentsolutions to management of Internet resources fail to measure thesecurity risk associated with accessing an Internet resource in a morepredictive way, before infections are isolated and signatures areidentified and distributed.

A possible analogy to the reputation of an Internet resource is thecredit score of an individual. A Web user would want to be informed ofthe reputation of a Web site before visiting it, just as a lender wouldwant to know the reputation, the financial reputation at least, of aborrower of the lender's money.

A credit score is based on a variety of fairly tightly related factors,such as existing debt, available credit lines, on-time payments,existing credit balances, etc.

In the United States, a credit score is a number based on a statisticalanalysis of a person's credit files that represents the creditworthinessof that person, which is the likelihood that the person will pay theirbills. A credit score is primarily based on credit information,typically from one of the three major credit agencies.

There are different methods of calculating credit scores. The best knownone, FICO, is a credit score developed by the Fair Isaac Corporation.FICO is used by many mortgage lenders that use a risk-based system todetermine the possibility that the borrower may default on financialobligations to the mortgage lender.

FICO® scores are provided to lenders by the three major credit reportingagencies: Equifax, Experian and TransUnion. When lenders order yourcredit report, they can also buy a FICO® score that is based on theinformation in the report. That FICO® score is calculated by amathematical equation that evaluates many types of information from theborrower's credit report at that agency. In order for a FICO® score tobe calculated on the borrower's credit report, the report must containsufficient information—and sufficient recent information—on which tobase a score. Generally, that means the borrower must have at least oneaccount that has been open for six months or longer, and at least oneaccount that has been reported to the credit reporting agency within thelast six months.

FICO scores provide a reliable guide to future risk based solely oncredit report data. FICO® scores have a 300-850® score range. The higherthe score, the lower the risk. But no score says whether a specificindividual will be a “good” or “bad” customer. And while many lendersuse FICO® scores to help them make lending decisions, each lender hasits own strategy to determine if a potential borrower is a goodcustomer. Although FICO won't reveal exactly how it determines a creditscore, it considers the following factors: payment history (35%);outstanding debt (30%); length of credit history (15%); types of credit(10%); and new credit (10%).

Returning to Internet resources, attackers have been using the Internetto attack the computers and other devices of users of the Internet.Attackers continue to take advantage of flaws in traditional securitymeasures and bypass reputation-based systems to increase attackeffectiveness.

In 2008, massive attacks were conducted that compromised hundreds ofthousands of legitimate Web sites with good reputations worldwide withdata-stealing malicious code. The attacks included sites from MSNBC,ZDNet, Wired, the United Nations, a large UK government site, and more.In the attacks, when a user's browser opened one of the thousands ofcompromised sites, a carefully crafted iframe HTML tag redirected usersto a malicious site rife with exploits. As a result, malicious code,designed to steal confidential information, was launched on vulnerablemachines. In addition to Web exploits, email spammers are also takingadvantage of the reputation of popular email services like Yahoo! andGmail to bypass anti-spam systems.

Also, spammers use sophisticated tools and bots to break the “CAPTCHA-”systems that were developed to keep email and other services safe fromspammers and other malicious activity. MICROSOFT Live Mail, GOOGLE'spopular Gmail service and Yahoo! mail services were all compromised bythis breakthrough method. Subsequently, spammers have been able to signup for the free email accounts on a mass basis and send out spam fromemail accounts with good reputations. With a free signup process, accessto a wide portfolio of services and domains that are unlikely to beblacklisted given their reputation, spammers have been able to launchattacks on millions of users worldwide while maintaining anonymity.

Thus, prior art solutions have focused on security when accessing knowninfected sites on the Internet from a network such as a local areanetwork or a wide area network.

Hegli et al., U.S. Pat. No. 7,483,982 for “Filtering Techniques forManaging Access to Internet Sites or Other Software Applications”discloses a system and method for controlling an end user's access tothe Internet by blocking certain categorized sites or limiting accessbased on bandwidth usage.

Hegli et al., U.S. Pat. No. 6,606,659 for a “System and Method forControlling Access to Internet Sites” discloses a system and method forcontrolling an end user's access to the Internet by blocking certaincategorized sites or limiting the number of times the end user canaccess an Internet site.

Yavatkar et al., U.S. Pat. No. 6,973,488 for “Providing PolicyInformation to a Remote Device” discloses a method for distributing highlevel policy information to remote network devices using a low-levelconfiguration.

Turley et al., U.S. Patent Publication Number 2005/0204050 for a “Methodand System for Controlling Network Access” discloses a system and methodfor controlling access to a specific site by using a gateway thatassigns incoming traffic to specific sections of the site.

Shull et al., U.S. Pat. No. 7,493,403 for “Domain Name Validation”discloses accessing domain name registries to determine the ownership ofa domain and monitoring the domain and registry.

Roy et al., U.S. Pat. No. 7,406,466 for a “Reputation Based Search”discloses using a search engine to present search results associatedwith measures of reputation to overcome the problem of META tags skewingthe search results.

Hailpern et al., U.S. Pat. No. 7,383,299 for a “System and Method forProviding Service for Searching Web Site Addresses” discloses.

Moore et al., U.S. Pat. No. 7,467,206, for a “Reputation System for WebServices” discloses a system and method for selecting a Web service froma search engine list which is ranked based on reputation information foreach Web service.

Definitions for various terms are set forth below.

FTP or File Transfer Protocol is a protocol for moving files over theInternet from one computer to another.

HyperText Markup Language (HTML) is a method of mixing text and othercontent with layout and appearance commands in a text file, so that abrowser can generate a displayed image from the file.

Hypertext Transfer Protocol (HTTP) is a set of conventions forcontrolling the transfer of information via the Internet from a Webserver computer to a client computer, and also from a client computer toa Web server. Internet is the worldwide, decentralized totality ofserver computers and data-transmission paths which can supplyinformation to a connected and browser-equipped client computer, and canreceive and forward information entered from the client computer.

JavaScript is an object-based programming language. JavaScript is aninterpreted language, not a compiled language. JavaScript is generallydesigned for writing software routines that operate within a clientcomputer on the Internet. Generally, the software routines aredownloaded to the client computer at the beginning of the interactivesession, if they are not already cached on the client computer.JavaScript is discussed in greater detail below.

Parser is a component of a compiler that analyzes a sequence of tokensto determine its grammatical structure with respect to a given formalgrammar. Parsing transforms input text into a data structure, usually atree, which is suitable for later processing and which captures theimplied hierarchy of the input. XML Parsers ensure that an XML documentfollows the rules of XML markup syntax correctly.

URL or Uniform Resource Locator is an address on the World Wide Web.

Web-Browser is a complex software program, resident in a clientcomputer, that is capable of loading and displaying text and images andexhibiting behaviors as encoded in HTML (HyperText Markup Language) fromthe Internet, and also from the client computer's memory. Major browsersinclude MICROSOFT INTERNET EXPLORER, NETSCAPE, APPLE SAFARI, MOZILLAFIREFOX, and OPERA.

Web-Server is a computer able to simultaneously manage many Internetinformation-exchange processes at the same time. Normally, servercomputers are more powerful than client computers, and areadministratively and/or geographically centralized. An interactive-forminformation-collection process generally is controlled from a servercomputer, to which the sponsor of the process has access. Serversusually contain one or more processors (CPUs), memories, storage devicesand network interface cards. Servers typically store the HTML documentsand/or execute code that generates Web-pages that are sent to clientsupon request. An interactive-form information-collection processgenerally is controlled from a server computer, to which the sponsor ofthe process has access.

World Wide Web Consortium (W3C) is an unofficial standards body whichcreates and oversees the development of web technologies and theapplication of those technologies.

XHTML (Extensible Hypertext Markup Language) is a language fordescribing the content of hypertext documents intended to be viewed orread in a browser.

XML (Extensible Markup Language) is a W3C standard for text documentmarkup, and it is not a language but a set of rules for creating othermarkup languages.

The prior art fails to provide solutions to the problems with accessingthe Internet.

SUMMARY OF THE DISCLOSURE

The present invention provides a predictive approach based on astatistical model built on a broad sampling of Internet resources withvarying degrees of risk. The present invention focuses on the reputationof a Web site, or any Internet-based service or resource. The reputationincorporates many factors that are relevant to the overall safety ofvisiting a site. The reputation assesses the over-time track record ofthe site and the provider that operates the web site, the currentcharacteristics of the pages and related files composing the site, andreputations of sites linked to the site and of referrers to the site.The overall assessment is expressed as a score, not unlike a FICO score,that predicts the potential risk of visiting the site which can be usedto protect users from inadvertently visiting or utilizing higher-risksites or services within the Internet.

There are many components of reputation available within the Internet.Much like other scoring mechanisms, such as credit scoring, the factorsto be considered must be decided upon, and the weight that each factorwill have in the overall “score” must be determined.

The present invention provides a system and method for defining areputation of an Internet service such as a Web site.

A basic element of reputation is how long a domain has been registeredto a particular company/entity. In addition, a domain which frequentlychanges hands is also interesting in a negative way relative toreputation.

referred steps of the invention are: evaluation of the importantfeatures to be included in the collection of reputation-relevantfeatures referred to as the reputation vector; collection of thereputation vectors for a large sample of Internet resource; training ofa classifier based on training sets of known high and low reputationservices/sites; testing of a model against a wide variety of randomsamples; run-time evaluation of requests for the Internet resource usingthe developed classifier and responding to reputation index informationrequests from clients which enforce network security policy.

The present invention preferably protects users against threats whichtypically are not related to the subject matter of the service, or site.The present invention preferably protects users and networks fromzero-day threats which have not been characterized or included inanti-virus signature files. The present invention preferably allowsnetwork managers to protect users and infrastructure without having torestrict access to particular categories of content. The presentinvention preferably allows higher security which is independent ofcultural or moral biases related to many categories of content.

One aspect of the present invention is a method for controlling accessto a Web site. The method includes transmitting a request for a Web sitefrom a browser on a client-side device of a local area network. The Website resides at a first server. The method also includes receiving therequest for the Web site at a security appliance of the local areanetwork prior to transmission of the request over the Internet. Themethod also includes analyzing a reputation vector for the Web site atthe security appliance. The reputation vector includes a plurality offactors for the Web site comprising at least one or more of country ofdomain registration, country of service hosting, country of an internetprotocol address block, age of a domain registration, popularity rank,internet protocol address, number of hosts, top-level domain, aplurality of run-time behaviors, JavaScript block count, picture count,immediate redirect and response latency. The method also includesgenerating a reputation index for the Web site based on the analysis ofthe plurality of factors. The method also includes determining if thereputation index for the Web site is above a threshold value establishedfor the local area network. The method also includes transmitting adecision transmission to the browser of the client-side device.

If the reputation index for the Web site is above the threshold value,the method further includes transmitting the request for the Web-siteover the Internet to a server for the Web site and receiving a Web pagefor the Web site at the local area network. In this situation, thedecision transmission is the Web page for the Web site. If thereputation index for the Web site is at or below the threshold value,the decision transmission is a Web page from the local area network.

The method can further include obtaining the plurality of factors forthe Web site. Obtaining the plurality of factors for the Web sitecomprises accessing the Web site, analyzing a plurality of HTMLdocuments for the Web site by crawling the Web site. Accessing the Website comprises rendering a page for the Web site. Analyzing theplurality of HTML documents comprises determining the JavaScript blockcount and the picture count of each of the HTML documents.

Another aspect of the present invention is a system for controllingaccess to a Web site. The system includes a network, a Web site and alocal area network. The network is the Internet.

The Web site is hosted at a first server and accessible over theInternet. The local area network includes a plurality of client-sidedevices and a security appliance. Each of the client side devices has abrowser. The security appliance controls access to the Internet by eachof the plurality of client-side devices. The security appliance has aservice engine for analyzing a reputation vector for the Web site andgenerating a reputation index for the Web site from the reputationvector. The reputation vector is based on a plurality of factors for theWeb site. The plurality of factors comprises at least one or more ofcountry of domain registration, country of service hosting, country ofan internet protocol address block, age of a domain registration,popularity rank, internet protocol address, number of hosts, top-leveldomain, a plurality of run-time behaviors, JavaScript block count,picture count, immediate redirect and response latency. Access to theWeb site by any of the plurality of client-side devices is determined onthe reputation index exceeding a threshold value established for thelocal area network.

Another aspect of the present invention is a method for controllingaccess to an Internet resource utilizing a reputation generating site.The method includes transmitting a request for an Internet resource froma browser for a client-side device of a local area network. The Internetresource resides at a first server. The method also includes receivingthe request for the Internet resource at the reputation generating siteprior to transmission of the request over the Internet to the firstserver. The method also includes analyzing a reputation vector for theInternet resource at the reputation generating site. The reputationvector includes a plurality of dimensions for the Internet resourcecomprising at least two of country of domain registration, country ofservice hosting, country of an internet protocol address block, age of adomain registration, popularity rank, internet protocol address, numberof hosts, top-level domain, a plurality of run-time behaviors,JavaScript block count, picture count, immediate redirect and responselatency. The method also includes generating a reputation index for theInternet resource based on the analysis of the plurality of factors. Themethod also includes determining if the reputation index for theInternet resource is above a threshold value established for the localarea network. The method also includes transmitting a decisiontransmission to the browser of the client-side device.

Another aspect of the present invention is a method for controllingaccess to an Internet resource. The method includes transmitting arequest for an Internet resource from an Internet-enabled clientapplication from a client-side device of a local area network. TheInternet resource resides at a first server. The method also includesreceiving the request for the Internet resource at a security applianceof the local area network prior to transmission of the request over theInternet. The method also includes determining if a reputation index forthe Internet resource is at or above a threshold value established forthe local area network. The reputation index is generated from areputation vector for the Internet resource. The reputation vectorcomprises a plurality of factors for the Internet resource comprisingsecurity history, legitimacy, behavior, associations and location. Thereputation index preferably resides in a database file at the securityappliance, which is immediately accessible by the security appliance fordetermining whether or not to allow access to the Internet resource.Alternatively, the reputation index is generated in real-time at a datacollection site accessible by the security appliance over the Internet,and the reputation index is forwarded to the security appliance from thedata collection site upon request. The method also includes transmittinga decision transmission to the Internet-enabled client application ofthe client-side device. The decision transmission allows or deniesaccess to the Internet resource.

Yet another aspect of the present invention is a method for controllingaccess to an Internet resource. The method includes transmitting arequest for an Internet resource from a Web browser for a client-sidedevice of a local area network. The Internet resource resides at a firstserver. The method also includes receiving the request for the Internetresource at a security appliance of the local area network prior totransmission of the request over the Internet. The method also includesconstructing a reputation vector for the Internet resource at thesecurity appliance. The reputation vector comprises a plurality offactors for the Internet resource comprising security history,legitimacy, behavior, associations and location. The method alsoincludes analyzing the reputation vector to generate a reputation indexfor the Internet resource based on the analysis of the plurality offactors and the reputation classifier. The method also includesdetermining if the reputation index for the Internet resource is at orabove a threshold value established for the local area network. Themethod also includes transmitting a decision transmission to the Webbrowser of the client-side device. The decision transmission allows ordenies access to the Internet resource.

Yet another aspect of the present invention is a method for building areputation database for Internet resources. The method includescollecting a plurality of factors for an Internet resource site topopulate a reputation vector for the Internet resource to performreputation analysis of the Internet resource. The method also includesreceiving the plurality of factors for the Internet resource at a datacollection site. The method also includes constructing a reputationvector for the Internet resource at the data collection site. Thereputation vector comprises a plurality of factors for the Internetresource comprising security history, legitimacy, behavior, associationsand location. The method also includes analyzing the reputation vectorto generate a reputation index for the Internet resource based on theanalysis of the plurality of factors and the reputation classifier. Themethod also includes storing the reputation index for the Internetresource at the data collection site. The method also includestransmitting the stored reputation index to a local area network uponrequest for managing access to the Internet resource.

The method further includes weighting each of the plurality of factorsbased on empirical knowledge of each of the plurality of factors. Themethod further includes obtaining the plurality of factors for theInternet resource using a crawler. Obtaining the plurality of factorsfor the Internet resource preferably comprises accessing the Internetservice, analyzing a plurality of HTML documents for the Internetresource, and crawling a plurality of linked Internet resources of theplurality of HTML documents for Internet resource. Analyzing theplurality of HTML documents preferably comprises determining theJavaScript block count and the picture count of each of the HTMLdocuments, browser hijacking, file downloads and a subject matter.

Yet another aspect of the present invention is a method for controllingaccess to an Internet resource. The method includes collecting a firstplurality of Internet resource reputation vectors. The method alsoincludes partitioning the first plurality of Internet resourcereputation vectors into a plurality of training sets. The method alsoincludes training a maximum entropy discrimination classifier with theplurality of training sets, the maximum entropy discriminationclassifier trained for a specific local area network. The method alsoincludes testing the trained maximum entropy discrimination classifierusing a second plurality of Internet resource reputation vectors. Eachof the second plurality of Internet resource reputation vectors isunknown to the trained maximum entropy discrimination classifier. Themethod also includes evaluating the tested maximum entropydiscrimination classifier. The method also includes providing feedbackto the evaluated maximum entropy discrimination classifier. The methodalso includes utilizing the reputation index at a local area network formanaging access to an Internet resource.

Preferably, each of the first plurality of Internet resource reputationvectors comprises a plurality of dimensions for the Internet resourcecomprising security history, legitimacy, behavior, associations andlocation, and the method further comprises weighting each of theplurality of dimensions.

Yet another aspect of the present invention is a method for training aMED classifier for controlling access to an Internet resource. Themethod includes collecting a plurality of reputation vectors forInternet resources. The method also includes partitioning the pluralityof reputation vectors into training sets. The method also includestraining a MED classifier with the training sets. The method alsoincludes testing the trained MED classifier against unknown Internetresources. The method also includes evaluating the trained MEDclassifier. The method also includes determining if the trained MEDclassifier has been adequately trained.

Having briefly described the present invention, the above and furtherobjects, features and advantages thereof will be recognized by thoseskilled in the pertinent art from the following detailed description ofthe invention when taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a system for controlling access to a Website.

FIG. 2 is a block diagram of a system for controlling access to a Website.

FIG. 3 is a flow chart of a method for controlling access to a Web site.

FIG. 4 is a Web page for a requested Web site.

FIG. 5 is a page for a local area network informing a requestor of thedenial of access to a Web site.

FIG. 6 is a block diagram of an Internet resource having a HTML documentthat is accessed by a crawler.

FIG. 7 is a flow chart of a method for generating a reputation index.

FIG. 8 is a flow chart of a method for controlling access to an Internetresource.

FIG. 9 is a flow chart of a method for utilizing a MED classifier forcontrolling access to an Internet resource.

FIG. 10 is flow chart of a method for controlling access to an Internetresource utilizing a MED classifier.

FIG. 11 is a block diagram of a system for utilizing a MED classifierfor controlling access to an Internet resource.

DETAILED DESCRIPTION

Reputation is a qualitative assessment of the safety of a website,expressed as a quantitative value that can be used in managing internetusage. Internet resources, such as Web sites, are safe, or of highreputation, if the Internet resource preferably has: a reputableownership and registration; a consistent history; had consistent contentduring that history; associated with other high reputation sites; from ageographically safe region; the Internet service provider (“ISP”) iswell-known and reputable; not been known to be a source of malwareinfection; and worked cooperatively with the end-user and the end-user'sWeb browser application.

While security threats are transitory since they come up suddenly andare mitigated as quickly as possible, reputation is built up over aperiod of time and is a more enduring quality. Reputation can be lost,or become ‘bad’, over a period of time with repeated security events,bad associations, and bad behavior. For that reason, the occurrence of asingle security breach (the site gets hacked and is a danger tovisitors) does not dramatically lower the reputation of a site. Repeatedoccurrences over time, however, will destroy the reputation of the site.

Competitive reputation products include social considerations in theirdefinitions, such that a highly reputable site, a site “held in highregard”, preferably has these characteristics: established record of Webpresence; not a source of network security risk; no introduction ofmalware; no popup ads; no persistent ad infection; is not pornographicor obscene; and has no illegal content.

The reputation of an Internet resource is preferably determined bysecurity, legitimacy, behavior, geographic location, associations andadditional factors. Legitimacy is determined by the top-level domain,the investment in the Internet resource (virtual hosting withnon-affiliated sites, multiple hosting and SSL security), the trafficvolume, the category age and the popularity rank. Legitimacy is alsopreferably determined by any or all of the following: the consistencybetween the registering and hosting city region or country; and city,region or country associated with the IP address. Behaviors include theuse of popup ads, browser hijacking and the use of auto-redirecting.Associations include the number of sites linking into the site, thereputations of the linked in sites and the reputations of the linked-tosites. The geographic location includes the hosting country, theregistration country, the region and the city. The geographic locationalso preferably includes the consistency between the registering andhosting country and the country associated with the IP address.

In a most preferred embodiment discussed below, machine learningtechnologies are utilized for controlling access to an Internetresource. A variation on support vector machine techniques calledMaximum Entropy Discrimination (“MED”) is a preferred machine learningtechnology. MED allows a computer to be trained to recognize therelative reputation of an Internet resource based on the features of theInternet resource. The set of features which characterize the reputationof the Internet resource is its reputation vector. Once trained, thecomputer uses the reputation vector for a requested Internet service toevaluate its reputation index, a score which can be used withempirically developed threshold values to block access where thereputation index is deemed to be too low to be safe.

A predictive security assessment for an Internet resource is providedbased on known facts about the Internet resource, which is more securethan relying only on knowledge of previously experienced securityattacks.

The system preferably provides classification of each Internet resourceat run-time given a Uniform Resource Identifier (URI) and the reputationvector of the Internet resource. The system returns a score, or index,expressing the results on a relative scale for use by requestingclients, typically a security product which integrates the reputationassessment as a service.

The reputation vector preferably comprises a combination of some or allof the following: country of domain registration; country of servicehosting; country of IP Address block; age of domain registration; timeknown to the assessor site; subject matter; classification age (timesince last re-categorization); rank (popularity); IP Address; virtualhosting; number of hosts; top-level domain (.com, .biz, .ru, etc);security history; run-time behaviors; popup ads; downloadableexecutables; virus-infected executables; JavaScript block count; picturecount; immediate redirect; and response latency. These features arecollected and evaluated for all model training samples and at run-timeon a per-user-request basis. Those skilled in the pertinent art willrecognize that other factors may be utilized which are relevant to thesecurity as determined by an assessor.

As shown in FIG. 1 , a system for controlling access to an Internetservice is generally designated 20. The system 20 preferably comprises alocal area network 30, the Internet 100 and an Internet service locatedat a remote server 60. The Internet resource is preferably a Web site. Alocal area network 30 preferably comprises a security appliance 50 and aplurality of client-side devices 55. The plurality of client-sidedevices preferably comprises desktop computers, laptop computers,personal digital assistants, smartphones and the like. Each of theclient-side devices 55 preferably has a Web-browser for accessing theInternet from the client side device 55. The security appliance 50preferably comprises a network access 51 for permitting access to theInternet from the local area network 30, and a service engine 52 fordetermining if a requested Internet resource has a reputation index thatmeets a threshold established for the local area network 30.

A method 1000 for controlling access to a Web site is shown in FIG. 3 .At block 1001, a request for a Web site is transmitted from a browserfor a client-side device of a local area network which is received at asecurity appliance of the local area network prior to transmission ofthe request over the Internet. At block 1002, a reputation index for theWeb site is obtained at the security appliance. The reputation index iscalculated from a reputation vector which preferably includes aplurality of factors for the Web site comprising country of domainregistration, country of service hosting, country of an internetprotocol address block, age of a domain registration, popularity rank,internet protocol address, number of hosts, top-level domain, aplurality of run-time behaviors, JavaScript block count, picture count,immediate redirect and response latency. At block 1004, a determinationis made if the reputation index for the Web site is above a thresholdvalue established for the local area network. At decision 1005, if thereputation index is not above the threshold, then at block 1006 accessto the Web site is denied and a transmission of the denial is sent tothe client-side device, preferably as a page 500 as shown in FIG. 5 . Ifat decision 1005 the reputation index for the Web site is above thethreshold, then the access to the Web site by the client-side device ispermitted by the security appliance, and preferably, as shown in FIG. 4, a Web page 400 is provided to the client-side device.

An alternative embodiment of the system 20 is illustrated in FIG. 2 .The system 20 preferably comprises a local area network 30, the Internet100, an Internet service located at a remote server 60 and a reputationgenerating site 70 preferably having a crawler 71 and a database 72. TheInternet service is preferably a Web site. A local area network 30preferably comprises a security appliance 50 and a plurality ofclient-side devices 55. Each of the client-side devices 55 preferablyhas a Web-browser for accessing the Internet from the client side device55. The security appliance 50 preferably comprises a network access 51for permitting access to the Internet from the local area network 30,and a service engine 52 for determining if a requested Internet servicehas a reputation index that meets a threshold established for the localarea network 30. The reputation generating site 70 provides reputationindices to service engine 52 of the security appliance 50. Thereputation generating site 70 preferably utilizes the crawler 71 andother means to access Internet resources such as the Internet resourcelocated at Web server 60. The other means preferably includes publiclyavailable data feeds, purchased databases, proprietary database, zonefiles from WHOIS database.

A flow chart for a method 2000 for generating a reputation index isshown in FIG. 7 . At block 2001, a HTTP request is transmitted from areputation generating site 70 for an Internet resource. From the HTTPrequest, a crawler 71 of the reputation generating site accesses theInternet resource. In accessing the Internet resource, as shown in FIG.6 , the crawler 71 preferably accesses at least one HTML document 91 ofa plurality of HTML documents of the Internet resource 90. At block2003, from the HTML documents and links within the HTML documents, thecrawler 71 obtains information concerning the Internet resource 90.

The reputation vector for the Internet resource 90 is based on some ofthis information obtained by the crawler 71. At block 2004, thereputation vector for the Internet resource is analyzed at thereputation generating site 70. At block 2005, a reputation index for theInternet resource 90 is generated at the data collection site. At block2006, the reputation index for the Internet resource 90 is stored in adatabase 72 of the reputation generating site 70. The reputation for theInternet resource is available to the security appliance as updates orindividual requests. At block 2007, the reputation index for theInternet resource 90 is transmitted to a LAN 30 for storage in a serviceengine 52 of a security appliance 50.

A flow chart for a method 3000 for controlling access to an Internetresource is shown in FIG. 8 . At block 3001, a request for an Internetresource is transmitted from an Internet-enabled client application fora client-side device 55 of a LAN 30. At block 3002, the request isreceived at a security appliance 50 of the LAN 30 prior to transmissionof the request over the Internet 100. At block 3003, a reputation indexfor the Internet resource is accessed from a database of a serviceengine 52 of the security appliance 50. The reputation index is based ona reputation vector which includes a plurality of factors for theInternet resource comprising at least two or more of country of domainregistration, country of service hosting, country of an internetprotocol address block, age of a domain registration, security history,popularity rank, internet protocol address, number of hosts, top-leveldomain, a plurality of run-time behaviors, JavaScript block count,picture count, immediate redirect and response latency. At block 3004, adetermination is made if the reputation index for the Internet resourceis at or above a threshold value established for the LAN 30. At decision3005, if the reputation index is below the threshold value, then atblock 3006 access to the Internet resource is denied and a transmissionof the denial is sent to the client-side device 55. If at decision 3005the reputation index for the Web site is at or above the thresholdvalue, then the access to the Internet resource by the client-sidedevice 55 is permitted by the security appliance 50.

Table One provides a list of the attributes for the reputation vectorand a description of each of the attributes.

TABLE ONE ATTRIBUTE DESCRIPTION Country 2-letter code, 3-letter code orfull name of country based on IP block Top-level Domain .com, .biz,.org, .gov, etc. Domain Age Number of months in existence on zone lists,or no less than the classification age Database Age Months sinceAuthority was entered into database Classification Age Months that theAuthority has held its current classification Hosts Number of IP'sassociated with the Authority Virtually Hosted T/F if the otherauthorities share associated IP's Popups T/F if the page opens newbrowser windows on its own Hijack T/F does the default page alter thebrowser configuration JavaScript Count of <SCRIPT> blocks in defaultpages Executables T/F does the authorities download executables toclient Pictures Count of pictures on default page Latency Number ofmilliseconds to return default page Rank Numerical ranking, used as T if<2,000,000, F otherwise in modeling Infected T/F were infected downloadfiles found by AV tools during site analysis Security Trend Number ofmalware infections in past 12 months Total Security Count Total numberof malware infections known Redirect Authority redirects to anotherAuthority IP Address Analysis of IP address for known threat sources,reserved IP ranges, and legacy IP address assignments ISP Internetservice provider City Region

Table Two is an example of a “good” Internet resource.

TABLE TWO ATTRIBUTE VALUE Authority USmoney.gov Country USA Top LevelDomain Gov Domain Age 18 Hosts  2 Virtual Hosts  0 Rank  1 Infected  0Security Events  0 Recent Events  0 PublicCoIP  0 GovernmentIP  1 Hijack 0 JavaScript  0 Executables  0 Pictures  0 Latency  0 Redirect  0

Table Three is an example of a “bad” Internet resource.

TABLE THREE ATTRIBUTE VALUE Authority www.c.reditcan.cn Country CN TopLevel Domain CN Domain Age 3 Hosts 1 Virtual Hosts 1 Rank 0 Infected 0Security Events 1 Recent Events 1 PublicColP 0 GovernmentIP 0 Hijack 0JavaScript 13 Executables 1 Pictures 14 Latency 826 Redirect 0

Depending on the threshold value established by the administrator of theLAN, the Internet resource of www.c.reditcan.cn with a reputation indexvalue of 51, is not available for access by a user based on itsreputation index, and the Internet resource of www.USmoney.GOV isavailable for access by a user based on its reputation index 95. Thus,even if the Internet resource of www.c.reditcan.cn is not a known sourceof malware or viruses, the present invention would prevent an end userclient from accessing the Internet resource since its reputation indexis deemed unsafe.

Another embodiment uses a MED algorithm to build a statistical model ona Web page based on good and bad Internet samples. This embodiment usesa unique optimization algorithm for training, as well as two otheroptimization steps for calibrating the outputs to be probabilities, in aprocess that tolerates some input errors while still yielding reliableoutputs. Training process feedback loops guide the implementer toimprove the model data through splitting data into sets for holdout,training, and testing guided by two criteria: most violating examples,and least understood examples. The implementer using the criteriaiteratively improves the quality of the training set which also reducesclassifier errors and is exponentially faster than having theimplementer manually verify or check the example assignments tocategories in random or haphazard order. The examples are randomlyreassigned before every training iteration to improve generalization.Sparse matrix math during the classification process improves processingspeeds to enable a modest computer to classify millions of URLs per day.The implementation allows for a multiple of dimensions, eachrepresenting a fact about the Internet resource, to be included in thereputation model, while classification speed of any particular Internetresource is independent of the number of total dimensions in itsreputation vector.

This embodiment is preferred since classifying a large percentage ofexisting Web sites into reputation risk assessments quickly andefficiently requires an automated process because the number of humansrequired is too large to be practical or economical. Further, definingautomated classification rules by hand is very hard and requires writingmany thousands of extremely specific as well as vague rules. All ofthese rules will interact with each other in an exponential number ofways, making human-based rule construction a daunting effort as well.The machine learning approach of this embodiment solves the problem byhaving humans define “training sets” or examples of each topic for theclassifier, which then “trains” by optimizing the weights each factorshould have in order to reduce classification error the most.

In addition to providing a good implementation of the learningalgorithm, this embodiment efficiently utilizes the human efforts inidentifying examples of good and bad reputations.

This embodiment preferably applies an effective learning formulationbased on the principles and theory of MED with an efficient optimizationalgorithm based on the principles of Platt's sequential minimizationoptimization (“SMO”), in conjunction with an overall optimization oftunable parameters and calibrated confidence scores, to solve thelearning problem given proper examples of Web sites of good and badreputation.

The process then involves having humans examine a list of “mostviolating” examples, the examples which were marked as being goodreputations but received extremely low confidence scores from theclassifier (and vice-versa), as well as “least understood” examples, theexamples which receive a confidence score close to the prior probabilityof the reputation.

By spending human time examining these two classes of examples, theclassifier benefits from having egregiously misclassified examples beingput into the proper reputation (good or bad) as well as providing theclassifier with the largest amount of new information as quickly aspossible. This combination improves the classifier's real-worldeffectiveness very quickly with minimal human effort. Thus, thisembodiment efficiently combines human and automated work to solve theproblem of automated reputation classification of Internet resources.

In one method, an evaluation of multiple factors (such as discussedabove) is included in determining a reputation vector for an Internetresource. This process is done for multiple Internet resources. Next,reputation vectors for a large sample of Internet resources arecollected at a data collection site. Next, a MED classifier is trainedusing the collection of reputation vectors based on training sets ofknown high reputation Internet resources and low reputation Internetresources. Next, a MED-based model for classification is tested againsta wide variety of random samples of Internet resources. Next, a securityappliance is deployed at a LAN. Next, a run-time evaluation of Internetresource requests is performed in using the developed MED classifier forresponding to reputation index information requests from clients basedon a LAN security policy. The MED-based model for classification ispreferably utilized at run-time to calculate a reputation index. In thismanner, this embodiment provides a predictive security assessment basedon known facts about an Internet resource, which is more secure thanrelying only on knowledge of previously experienced security attacks.This embodiment provides a LAN real-time updates, real-timeclassification of non-cached URLs and a real-time feedback loop.

A flow chart of a method 4000 for utilizing a MED classifier forcontrolling access to an Internet resource is shown in FIG. 9 . At block4001, multiple reputation vectors for a large sample of Internetresources are collected preferably at a reputation generating site. Thereputation vectors for the Internet resources are previously generatedas discussed above. At block 4002, the reputation vectors arepartitioned into multiple training sets. The training sets comprise atleast two training sets divided into high reputation Internet resourcesand low reputation Internet resources. At block 4003, a MED classifieris trained using the training sets of high reputation Internet resourcesand low reputation Internet resources to create a trained MEDclassifier. At block 4004, the trained MED classifier is tested againsta wide variety of Internet resources which are not grouped into trainingsets and the reputation index is unknown to the trained MED classifier.At block 4005, the tested MED classifier is evaluated to determine theaccuracy of the tested MED classifier and to determine the mostviolating examples of either a wrongly categorized high reputationInternet resource or low reputation Internet resource, and the leastunderstood Internet resources. At decision block 4006, an evaluation ofthe testing is performed. If the testing was performed correctly, thenat block 4007 the MED classifier is considered trained and ready foroperations. If the testing was inadequate feedback is provided to theMED classifier concerning the wrongly categorized high reputationInternet resources or low reputation Internet resources, and the leastunderstood Internet resources. The process is continued at block 4003again until the MED classifier is properly trained.

In another embodiment, a reputation index is returned immediately from astored set of reputation indexes calculated prior to the user's request.As shown in FIG. 10 , a method for controlling access to an Internetresource utilizing a MED classifier is generally designated 5000. Atblock 5001, a request for an Internet resource is transmitted from anInternet-enabled client application for a client-side device 55 of a LAN30. At block 5002, a reputation vector for the Internet resource isanalyzed preferably at a MED classifier or at a security appliance forthe LAN. At block 5003, a reputation index for the Internet resource isaccessed/generated from a database of a service engine 52 of thesecurity appliance 50. The reputation index is preferably based on areputation vector which includes a plurality of factors for the Internetresource comprising at least two or more of country of domainregistration, country of service hosting, country of an internetprotocol address block, age of a domain registration, security history,popularity rank, internet protocol address, number of hosts, top-leveldomain, a plurality of run-time behaviors, JavaScript block count,picture count, immediate redirect and response latency. At block 5004, adetermination is made if the reputation index for the Internet resourceis at or above a threshold value established for the LAN 30. At decision5005, if the reputation index is below the threshold value, then atblock 5006 access to the Internet resource is denied and a transmissionof the denial is sent to the client-side device 55. If at decision 5005the reputation index for the Web site is at or above the thresholdvalue, then the access to the Internet resource by the client-sidedevice 55 is permitted by the security appliance 50. In such anembodiment, a pre-calculated reputation index residing on the LAN orquickly available to the security appliance of the LAN provides for amuch faster response (if not immediate response) as to the accessibilityof the Internet resource.

FIG. 11 illustrates a system 20 for controlling access to an Internetresource utilizing a MED classifier site 77. The system 20 preferablycomprises a local area network 30, the Internet 100, a MED classifiersite 77, and an Internet service located at a remote server 60. TheInternet resource is preferably a Web site. A local area network 30preferably comprises a security appliance 50 and a plurality ofclient-side devices 55. Each of the client-side devices 55 preferablyhas a Web-browser for accessing the Internet from the client side device55. The security appliance 50 preferably comprises a network access 51for permitting access to the Internet from the local area network 30,based on data from the MED classifier site 77, which determines if arequested Internet resource has a reputation index that meets athreshold established for the local area network 30.

Table Four provides an example of some dimensions and the sorted modelweights of the MED classifier.

TABLE FOUR SORTED MODEL IDENTIFICATION DIMENSION WEIGHTS 1966272070Domain age  3.785360   2307717 Gov  1.969750 1906396306 Paris  0.6477841477426223 Hijack −19.887100

From the foregoing it is believed that those skilled in the pertinentart will recognize the meritorious advancement of this invention andwill readily understand that while the present invention has beendescribed in association with a preferred embodiment thereof, and otherembodiments illustrated in the accompanying drawings, numerous changesmodification and substitutions of equivalents may be made thereinwithout departing from the spirit and scope of this invention which isintended to be unlimited by the foregoing except as may appear in thefollowing appended claim. Therefore, the embodiments of the invention inwhich an exclusive property or privilege is claimed are defined in thefollowing appended claims.

What is claimed is:
 1. A method for controlling access to web resources,the method comprising: receiving a request from a client application foran Internet resource; prior to transmission of the request for theInternet resource over the Internet from a local area network, comparinga reputation index for the Internet resource to a threshold value togenerate a comparison result, the reputation index generated from aplurality of factors for the Internet resource, the plurality of factorscomprising a top-level domain, an age of domain registration, a trafficvolume, a detection of malicious code, a domain registration location,and uniform resource locator (URL) redirection, and an IP address; andallowing or denying the client application access to the Internetresource based on the comparison result.
 2. The method of claim 1,wherein the plurality of factors includes a private Internet protocol(IP) address factor.
 3. The method of claim 2, wherein the plurality offactors includes a government IP address factor.
 4. The method of claim1, wherein the plurality of factors includes a country of servicehosting, a country of an internet protocol address block, a popularityrank, a number of hosts, a plurality of run-time behaviors, a scriptblock count, a picture count, and a response latency.
 5. The method forcontrolling access of claim 1, further comprising: maintaining a machinelearning model representing the plurality of factors for a plurality ofInternet resources and reputations for the plurality of Internetresources; and generating the reputation index for the Internetresource, wherein generating the reputation index comprises processingthe plurality of factors for the Internet resource using the machinelearning model.
 6. The method of controlling access of claim 5, whereinthe reputation index for the Internet resource is pre-generated prior toreceiving the request from the client application for the Internetresource.
 7. The method of controlling access of claim 6, furthercomprising providing the reputation index for the Internet resource tothe local area network prior to receiving the request from the clientapplication for the Internet resource.
 8. The method of controllingaccess of claim 5, wherein the reputation index for the Internetresource is generated after receiving the request for the Internetresource from the client application.
 9. The method of controllingaccess of claim 1, wherein allowing access to the Internet resourcecomprises transmitting the request for the Internet resource over theInternet to a server and transmitting a responsive web page to theclient application.
 10. The method of controlling access of claim 1,wherein denying access to the Internet resource comprises returning aweb page from the local area network to the client application.
 11. Asystem for controlling access to web resources, the system comprising: aprocessor; a memory storing instructions executable by the processorfor: receiving, by the processor, a request from a client applicationfor an Internet resource; prior to transmission of the request for theInternet resource over the Internet from a local area network, comparinga reputation index for the Internet resource to a threshold value togenerate a comparison result, the reputation index generated from aplurality of factors for the Internet resource, the plurality of factorscomprising a top-level domain, an age of domain registration, a trafficvolume, a detection of malicious code, a domain registration location,and uniform resource locator (URL) redirection, and an IP address;allowing the client application access to the Internet resource based onthe comparison result; and denying the client application access to theInternet resource based on the comparison result.
 12. The system ofclaim 11, wherein the plurality of factors includes a private Internetprotocol (IP) address factor.
 13. The system of claim 12, wherein theplurality of factors includes a government IP address factor.
 14. Thesystem of claim 11, wherein the plurality of factors includes a countryof service hosting, a country of an internet protocol address block, apopularity rank, a number of hosts, a plurality of run-time behaviors, ascript block count, a picture count, and a response latency.
 15. Thesystem of claim 11, further comprising instructions executable by theprocessor for: maintaining a machine learning model representing theplurality of factors for a plurality of Internet resources andreputations for the plurality of Internet resources; and generating thereputation index for the Internet resource, wherein generating thereputation index comprises processing the plurality of factors for theInternet resource using the machine learning model.
 16. The system ofclaim 15, wherein the reputation index for the Internet resource ispre-generated prior to receiving the request from the client applicationfor the Internet resource.
 17. The system of claim 16, wherein theprocessor comprises a first processor coupled between the local areanetwork and the Internet and a second processor, and wherein the systemfurther comprises instructions executable by the second processor forproviding the reputation index for the Internet resource to the localarea network prior to the first processor receiving the request from theclient application for the Internet resource.
 18. The system ofcontrolling access of claim 15, wherein the reputation index for theInternet resource is generated after the processor receives the requestfor the Internet resource from the client application.
 19. The system ofclaim 11, wherein allowing access to the Internet resource comprisestransmitting the request for the Internet resource over the Internet toa server and transmitting a responsive web page to the clientapplication.
 20. The system of claim 19, wherein denying access to theInternet resource comprises returning a web page from the local areanetwork to the client application.